42 research outputs found

    Fair Column Subset Selection

    Full text link
    We consider the problem of fair column subset selection. In particular, we assume that two groups are present in the data, and the chosen column subset must provide a good approximation for both, relative to their respective best rank-k approximations. We show that this fair setting introduces significant challenges: in order to extend known results, one cannot do better than the trivial solution of simply picking twice as many columns as the original methods. We adopt a known approach based on deterministic leverage-score sampling, and show that merely sampling a subset of appropriate size becomes NP-hard in the presence of two groups. Whereas finding a subset of two times the desired size is trivial, we provide an efficient algorithm that achieves the same guarantees with essentially 1.5 times that size. We validate our methods through an extensive set of experiments on real-world data

    Off-the-grid: Fast and Effective Hyperparameter Search for Kernel Clustering

    Get PDF
    Kernel functions are a powerful tool to enhance the kk-means clustering algorithm via the kernel trick. It is known that the parameters of the chosen kernel function can have a dramatic impact on the result. In supervised settings, these can be tuned via cross-validation, but for clustering this is not straightforward and heuristics are usually employed. In this paper we study the impact of kernel parameters on kernel kk-means. In particular, we derive a lower bound, tight up to constant factors, below which the parameter of the RBF kernel will render kernel kk-means meaningless. We argue that grid search can be ineffective for hyperparameter search in this context and propose an alternative algorithm for this purpose. In addition, we offer an efficient implementation based on fast approximate exponentiation with provable quality guarantees. Our experimental results demonstrate the ability of our method to efficiently reveal a rich and useful set of hyperparameter values.Comment: ECML-PKDD 202

    Diversity-aware kk-median : Clustering with fair center representation

    Full text link
    We introduce a novel problem for diversity-aware clustering. We assume that the potential cluster centers belong to a set of groups defined by protected attributes, such as ethnicity, gender, etc. We then ask to find a minimum-cost clustering of the data into kk clusters so that a specified minimum number of cluster centers are chosen from each group. We thus require that all groups are represented in the clustering solution as cluster centers, according to specified requirements. More precisely, we are given a set of clients CC, a set of facilities \pazocal{F}, a collection F={F1,,Ft}\mathcal{F}=\{F_1,\dots,F_t\} of facility groups F_i \subseteq \pazocal{F}, budget kk, and a set of lower-bound thresholds R={r1,,rt}R=\{r_1,\dots,r_t\}, one for each group in F\mathcal{F}. The \emph{diversity-aware kk-median problem} asks to find a set SS of kk facilities in \pazocal{F} such that SFiri|S \cap F_i| \geq r_i, that is, at least rir_i centers in SS are from group FiF_i, and the kk-median cost cCminsSd(c,s)\sum_{c \in C} \min_{s \in S} d(c,s) is minimized. We show that in the general case where the facility groups may overlap, the diversity-aware kk-median problem is \np-hard, fixed-parameter intractable, and inapproximable to any multiplicative factor. On the other hand, when the facility groups are disjoint, approximation algorithms can be obtained by reduction to the \emph{matroid median} and \emph{red-blue median} problems. Experimentally, we evaluate our approximation methods for the tractable cases, and present a relaxation-based heuristic for the theoretically intractable case, which can provide high-quality and efficient solutions for real-world datasets.Comment: To appear in ECML-PKDD 202

    Provable randomized rounding for minimum-similarity diversification

    Get PDF
    When searching for information in a data collection, we are often interested not only in finding relevant items, but also in assembling a diverse set, so as to explore different concepts that are present in the data. This problem has been researched extensively. However, finding a set of items with minimal pairwise similarities can be computationally challenging, and most existing works striving for quality guarantees assume that item relatedness is measured by a distance function. Given the widespread use of similarity functions in many domains, we believe this to be an important gap in the literature. In this paper we study the problem of finding a diverse set of items, when item relatedness is measured by a similarity function. We formulate the diversification task using a flexible, broadly applicable minimization objective, consisting of the sum of pairwise similarities of the selected items and a relevance penalty term. To find good solutions we adopt a randomized rounding strategy, which is challenging to analyze because of the cardinality constraint present in our formulation. Even though this obstacle can be overcome using dependent rounding, we show that it is possible to obtain provably good solutions using an independent approach, which is faster, simpler to implement and completely parallelizable. Our analysis relies on a novel bound for the ratio of Poisson-Binomial densities, which is of independent interest and has potential implications for other combinatorial-optimization problems. We leverage this result to design an efficient randomized algorithm that provides a lower-order additive approximation guarantee. We validate our method using several benchmark datasets, and show that it consistently outperforms the greedy approaches that are commonly used in the literature.Peer reviewe

    Recent results and open problems in spectral algorithms for signed graphs

    No full text
    | openaire: EC/H2020/871042/EU//SoBigData-PlusPlusIn signed graphs, edges are labeled with either a positive or a negative sign.This small modification greatly enriches the representation capabilities of graphs.However, their spectral properties undergo significant changes, introducing newchallenges in related optimization problems. In this extended abstract we discussrecent results in spectral methods for signed graph partitioning and communitydetection, and propose open problems arising in this context.Peer reviewe

    Column subset selection in practice : efficient heuristics and regularization

    Full text link
    Hoy en día, los datos están disponibles en un volumen sin precedentes. Una cantidad abrumadora de dispositivos conectados a Internet genera un flujo constante de información por todo el mundo, mucha de la cual se procesa en tiempo real o se almacena para usos posteriores. Extraer conocimiento de estos ingentes conjuntos de datos suele ser una tarea difícil. Su tamaño exige el uso de recursos computacionales masivos, lo que motiva el diseño de algoritmos eficientes. Además, estos datos suelen contener mediciones de una gran cantidad de variables, y esto acarrea una amplia serie de problemas. Para atajarlos, se estudia una familia de técnicas conocida como ”reducción de la dimensionalidad”. En esta tesis abordamos el tema de la selección de variables, un subconjunto de las técnicas de reducción de la dimensionalidad con la particularidad de que preservan el significado semántico de las variables originales. Para ello, analizamos un problema conocido ”selección de un subconjunto de columnas”. Una ventaja significativa de este problema es que da lugar a modelos simples y en ocasiones fáciles de interpretar. En la actualidad, los avances en ciencias de la computación aplicada van a menudo acompañados de preocupaciones sobre ética y transparencia, por lo que la simplicidad de los modelos utilizados puede ser clave en muchos casos. El problema de la ”selección de un subconjunto de columnas” ha recibido bastante atención durante los últimos años, principalmente desde un punto de vista teórico. En este texto analizamos el problema desde una perspectiva más práctica. Las contribuciones aquí recogidas se resumen a continuación. En primer lugar proponemos el uso de una heurística de búsqueda local. Mostramos empíricamente que ofrece un rendimiento superior al de otros algoritmos y demostramos garantías de aproximación elementales. Además, aprovechamos la naturaleza del problema para formular una implementación eficiente adecuada para el uso práctico. En segundo lugar, introducimos formulaciones regularizadas del problema. Proponemos un algoritmo voraz para optimizarlas y demostramos empíricamente que los subconjuntos de columnas resultantes son superiores con respecto a múltiples criterios. ----------ABSTRACT---------- Today, data are available at an unprecedented scale. An overwhelming quantity of Internet-connected devices generate a constant trickle of pieces of information all over the world, much of which are processed in real time or stored for later use. Making sense of these enormous data sets is often an challenging endeavour. Their size demands the use of massive computational resources, which motivates the design of efficient algorithms. Additionally, these data usually contain measurements of a large number of variables, which poses a wide variety of problems. To address the latter, a family of techniques commonly referred to as dimensionality reduction is studied. In this thesis we address the problem of feature selection, a subset of dimensionality reduction methods that preserve the semantic meaning of the original data variables. To do so, we analyze a problem formulation known as column subset selection. A significant advantage of column subset selection is that the models it produces are simple and in some cases easy to interpret. In an age where notable advances in applied computer science are met with growing concerns about ethics and transparency, model simplicity can become a key requirement in many scenarios. The column subset selection problem has received significant attention in the computer science literature over the last few years, mainly from a theoretical perspective. Here we analyze the problem from a more practical standpoint. Our contributions can be summarized as follows. First, we propose the use of a local search heuristic. We show empirically that it outperforms existing algorithms and derive elementary approximation guarantees. Furthermore, we take advantage of the nature of the problem formulation to derive an efficient implementation suitable for practical use. Second, we introduce regularized formulations of the problem. We derive a greedy algorithm for these new objectives and demonstrate empirically that it produces improved subsets with respect to multiple criteria

    Reconciliation k-median

    No full text
    | openaire: EC/H2020/654024/EU//SoBigDataWe propose a new variant of the k-median problem, where the objective function models not only the cost of assigning data points to cluster representatives, but also a penalty term for disagreement among the representatives. We motivate this novel problem by applications where we are interested in clustering data while avoiding selecting representatives that are too far from each other. For example, we may want to summarize a set of news sources, but avoid selecting ideologically-extreme articles in order to reduce polarization. To solve the proposed k-median formulation we adopt the local-search algorithm of Arya et al. [2], We show that the algorithm provides a provable approximation guarantee, which becomes constant under a mild assumption on the minimum number of points for each cluster. We experimentally evaluate our problem formulation and proposed algorithm on datasets inspired by the motivating applications. In particular, we experiment with data extracted from Twitter, the US Congress voting records,and popular news sources. The results show that our objective can lead to choosing less polarized groups of representatives without significant loss in representation fidelity.Peer reviewe
    corecore